n 15« a«i 

AOTBOR' 
liTLE, 

POB DITB 
ROffE 



.EDBS PfilCE 
DBSCBlPtOBS 



IDHTIFIBBS 



IBSTBBCf 



DOCOIIBT BISOBB 



CS.OO« 061 



Bo%9 *&ld«n J.; Boptinsy Carol J. . 
P«r«ia9 Hord Strings frop Text vith a Coip^terr . ' 
iBpllcatioos for Bsading Instioction. " 
tey 78 - f ' 

l4p.; Paper, presented at the InnQal Beeting of tte 
International Beading Association (23rd, /Boxiston» 
Texas, Baj 1-5, 1978) ; For related docoaent S9e CS 

(JO 4 062 -1 

BF-$0..83 BC-$1.67 Pins Postage. ' * 

Beginning Beading; *Coa^Q.ter^ •Content inal'jsis; 
Bleaentary Bdacation; •Phrase stroct^re^ •Beading 
Ijfstraction; Beading Skills; Sight yoc<1balari; «' 
Sffitax; •Bord Freqaencj; •lord Becognition 
•lord strings * * " - .» 



Coapilation of a list of l^he aost co»aon phjase's used 
in .reading vas' begtin -with the rationale that the qoick reco9.Bitidn^f 
phrases' .aoo Id facilitate, reading coaprehension. Thes4' first effort 
shoved tlrat- categorizing phrases bj parts of speech di^not pr<^vide 
•acceptah^le levels of accaracy. The systea that was effective, 
hovever, used a coapoter prograa tha»t xecotded every consecotiae t'(K}- 
. and three-iford seqaence in th» text saapl^e and d^terainfd which of 
•^Ihese vprd strings recarred aost freqjientlj. The coapatet frbgraa 
Bakes possible saapUnge of large aaoants of text'50,000 vords or 
Bore-thas eliainating the idiosyncrasies o*f text saapling. The 
researchers who develop^ this systea believe that< the coaaon phrases 
it identifies should tfe ^taoght in aoch the saae aanner as co'B»on 
words are now taaght in beginning reading instrection. (BL) 



^ B€prodQCtio&8 supplied t>f BPBS are the beet that oao be made ^ 
e ^. froB the original docaeent. ^ 

e#eeeeeee#e#eeeeeeeeee^#eee################«###^^^^^^^^^^^V^^^^^^^^^* 



U$ D«^AirTI*t«TO^HtAtTH. 

^ MiTtOMAI. iMSTlTUTt Of 

tOUCATtON ^ 

TM.i. oocu<*e»»T v^i^s SEEN «EP«o Aldcn J. Moe 

DUCCO EXACTLY AS •fCElvEO P«OM . CVinnAti^n A 

rSiiM^soNot ORGAN. 2AT^o«. CM. wucai^ion ^ 

jfTiNon POINTS Qf vf€*c>^^'N'ON5 PuTciue Universitv 

STATiO PO NOT NECESSARILY BEPRE- t uu wiuc wwxtcioj.uj 

stNT o*fk;ial NATIONAL INSTITUTE o^ 'Wfist Lafavette, IN 47907 

iOOCATtON I^StTtON OB POL'Cr ^ ' 



/ 



.PERMISSION TO REPRODUCE THIS 
MATERIAL HAS SEEN GRAFTED BY 

Aldeft J. Moe 

Carol J. Hopkins 

TO THE EDUCATIONAL RESOURCES 
{i^ORMAT»ON CW£R ^£R>C> AND 
USERS OF THE EftiC SYSTHW ( 



PARSING WORD STRINGS FROM TEXT WITH A COMPUTER: 
LMPLICATIOKS FOR READING INSTRUCTION 



Alden J. Moe anct Carol J. Hopkins^ 



^Purdue University 

9 



\k paper presented at the twenty-third annual convention of the Inter- 
n5^tiollal Reading Association, Houston', May 4, 1978 "^"^ ^^ 



The ability to quickly associate meaning with units of written 

•language is considered crucial to the comprehension of text (Smith, * 

« 

1971). Affiorig the units of written language the reader, must process 
are individual words, phrases, « clauses,- ^sentences and discourse struc- 
tares. Word' lists have been compiled for reading instruction with th^ 
criterion that the Wbrds be common, and that th^se connon words -be 
tav^ht ^riy in reading instruction; several such word lists -have 



1 ' • ' 

The authors gratefully acknowledge the help Lee Congdon who devel- 
* ' . ■> 

M • 

ope^ the coopuJfer program discussed in tjiis report* and Robert Hieb 
who made many trial runs in the process of the program developtoent . 



become popular and artf widely u^d by classroom teachers (Thorndike and 

Lorge, Dale and Chall, 1048; Carroll, Davies, Richman, 1971; 

Harris and Jaopbson, 1972)* List of conilon phrases (or comiaon word * 

strings), however, are not found in the research literature or in 

instructional mterials (basal 'rea'der, manuals, workbooks, etc . > even 

thbu^, it is- believed that the quick re<f:ognition of^phras^ yill facili- 

tate comprehension. The only available list of common phrases is »the 

Okie |6ompiled-.by Dolch (19^8) thirty years ago. 
^ ^. • 

The purpose of this report is to provide a raticinal'e-^r a 4 

justification — for the nfeed to identify conmon word string* in text. 

This justification touch upon some theories of language and/or 

reading processing and present o\ir implications for reading instruc- 

tion. In addition, we will d%3crrke some of the stages that brought 

us to the point where we felt we*could actually parse word strings 

from text vith a computer and identify the most common. 

. Both "word strings" and 'phrases" have been used at this pgint to 

ihdicat4 word groups where the words appear together in text . A more 

precise def initiocr^of in'ra^ser-teCfCe word groups sjch as pnrases, clauses 

' T 

and* strings is def erred ^"tintil a following section. • * 



Significance of the Problem , 

The more automatic the recognition of the chunks of* langtfege being 
rfead and the less effort expentJed on decoding, the greater the* likeli- 
hood of Complete* comprehension . LaBerge and Samuels (197^) and- Sanaaels- 
i1976) refer this as automatio decoding or automaticity, Samuels (1975) 

/states that "in order 16 have both fluent reading and gcbd comprehension, 

' ** * , ' ♦ 

the st^dent must go beyond accuracy to automaticity in decoding" 323) 



In ather words, the reader l^s a limited amount of cojipitive energy, - 
or ability', or memory with which to accomj^(sh the reading t^sk; the 
Bibre -cognitive energy usedr for decoding, the less for (Jomprehension . 
The' developpient- of .automatic ity probably^ begins at' th© word level; bi>t 
LaBergfe^nd Saouels state that if the reader 

^begins to organize some Qf th^ words into short groups or. phrases 
. as he reads, then ^Ur^her repetitions can strengthen these units 
> as. well as word units. In this way he can preak, through word-by- 

\ 

word* reading and apply the benefits of further repetitions tc 
automati2ati9n of larger .units . (p. 315). 

The iapoh^iance of "phras'e reading*" over "word readir;g^ is demon- 

strated by noting differenc^ in the fixation length of naive and"^ 

fluent readers. Pgr example, fxr^t-grade children may .make two fixa- 

tions per word whereas hign-school seniors maice one fixation for about 

every two words (Taylor, FracKer.pohl , and Patter,^ 196C). Arvj m a 

study of thihd- and sixth-gra^e readers, Rod^ (197^-75) fourfl that 

the eye-v6i«e span was longer for the o^der raiders* suggesting that 

the oldeh readej'fe attemjteo to decode the larger uoits of meaning . 

The work of Wisher ( 1976, 1977 ) pr*ovides further evidence that 

* • ; ' /I 

t^jfh reader yses his ^hderstarxiing c'f syntax *'to parse word striPiga / 

into convenient processir^^iunits" ('p. 601). .It is like^iy that under- 

standing' or semantic int^ration occurs between phras^ and clauses ' 

(more likely cla^uses, J^ut that discussioti ^is 'beyond the reaJ^c o*f ihis 

paper). Further support is provided by Fodor/aryj Bever 0.965) who 
• • • . 

^ found that listeners group, words (for understanding) according td 
' th^ syntax of the sentence. - I > * 

The ia^rtaggp of beV^g able to riad phrase^^s been discussed . 



•by a nuttber%)f reading educators irKluding .Bond and Tinker' 41975) , 
'Harris and Sipay (197 5). Heilman (1972), Heilman and Holaes. ( 1 972), . 

0 ' 

4 * * 

aixd Zintz (1975); they believe that good readers organize the text ' 
they read into meaningful unit^ such ^s' phrases . However, many poor^ 
^readers do not do this and comprehension is poor ^even when they have 
been^^e-ta'ught each^ individual word in the selection (Oaken, Weiner, 
and Crcoer; 19X^1), 'and it hasj-k^een found that trainir^ ' in* the reading 
of phrases has iaproved the reading of remedial students (Amble, 196 '7)- 

Phrases, 'Claus'es, and Word.StPings , ' ^ • - 



' ' In our earliest efforts we were interested in* identifying ccanon, 
re<?ccurring phrases' such as prepositional phx^^s^.^ For reasons which 
will explained late'r', those efforts were unsuccessful so we resorted 
to identifying common "worl strir^s. At thi^ point^^a discussion of what 
is meant by phrases, clauses anc word strings .is appropriate. In the 
-sentence below there is a nojn phrase (Little children; fallowed by 
yerb phr^ase (w^e playir.g) which' is followed by 4 prepositional phi^age * 
{ Tr* the park) . v 

Little cniicren were playing m tne p^rk. 

, - ■ " _ . 

The noun .phrase ''and the verb pnrase (Little* children were playing) forz,^ 



While transfonoati^al granmar theory does npt provide for the cate-* 
gor^zation of phrases according to parts of speech, we found the tra- 
ditional labels useful ^ an transf crrriational graiaraar, a sentence"^ may 
be divided into a noun phrase and a' verb phrase. Additional information 

in this area may be r?>und in^DeStefano (1978) and Jacobs ind Rosencsa^oa 

■ * 

.(1968). 



X 



a wsSiisi or- independent clause.* Any^group of consecutive word^ CLittie ^ 
children, Little, children were» children wer^e playing, >jere playing m, 
^playing in the, in the park, and so on) constiUites a word string. 

We were — ^nd are — priIaa^il^' i-^.te rested it. c oncer, were grcups 
which we expected to be true' phrases according to a traditional graanar- 
ian's definition. However, a more appropriate desoAptcr fdr the word 
groupe we identifi^ is the ters "wore stcm^ (or, wojrd strings). 

« * * * 

E^rly Efforts at Parsing , Phrases ' . ' v 

• ' Because We wanied- tc able to anaiyie large amounts of texts. 

w V 

(initially we felt at least \^,2ZZ words}, the application of coajputer 

tech^iplogy was a critical part .'of oar wock. The fact that certain 

« * 

kinda of anailyses 'say-ce ^ccoc^plished thrpu^ tne use of computers 

has been ^lionsxratec ''Kucera and F rar*eis, - 1967; Carroll, Davies and 

Hichmafc, 1971; Harris ana Jacobson, i 972. Hoe,.19'^3; and Hopkins a»d 

Koe, 1975). However, tnis study recuirec prograasiir^g of a sofcewnat 

different nat«^e. 

We identified five types of pnrases (prepositioral, participai, 

gerund, infinitive and vero) w.^icn co;snonly appear i,n written materials^ 

Since It was anticipated that prepositional phrases cojld be identified 

by the computer witn a hrgh degree of accuracy we worked with a cpa- 

-puter programoer to develop such a program. By pr-ograacing the com- 

♦ 

puter ;to locate all prepositions' (with a list of 52 prepositions stored 
in the computer's memcr>0 an^id 'then parse out the preposition and the 
two^word st/ing which follpwed it, we fourld that indeed it was possible 
for the conputer to identify *^nese three-word strings with 99% accuracy. 
That' is, we only missed a bqut-'H of the prepositional phr^s. 



5^ 



f 



proble* arose, -however, in that -even though these tnree-word strings 
began Witti a preposition, they di<l oot all function as prepositional 
phrases. We then eliminated prepositions grocs th|^list which rarely 
se^ed to -function -as J:he first woit^d-'ir, prepositional phrases. After 
aany program revisions a'nrf trial runs we were able to parse o^jt almost ♦ 
ail -i97-99^^) of the prepositi^al phrases >it of the text. However, 
we were still parsir^g out sar.y word strings which *were not prepositional 
phrases. Arid whep we examined trie strings which had t>een parked <xit, 
only at>out 62% were actual^repps;.tior^l phrases; we found .this-level 
of accuracy to'i>e unacceptable . ' 

Later Efforts ■ . ' 

We th^n ^e^ided to approa^cr. tr.e proclsn of identifying cosncon' 
phrases frc^ a completely r.ew perspective. Rather thar. categorizing 
phrases by parts ^ speecr^ another j2or:puter pro-am was developec 
which identified every consecutive two- and three-worJl sequence* foarid 
in the written t>2xt, store it m seoory, arid, at the end of all text 
input, '^tabjlate a^l possible two- and tnree-word strings. 

ThJfough Bucn trial and error, the investig^tor s wer^ able to 
develop a prograc that parsed oJt corxrcn word stririgs which are', oy 
tradttior^al definitions, actual prirases £r which- are tne fir^ two or 



i 



three words of an act-jal pr.rase. Some of the coHancn word strings iden- 
tified, however, cannot be oategorized by traditional definitions land 
are, ^therefore, sinipiy referred to as co*Dnion stri ps) . Cnce the new 
program was operational a corpus of 15,0C0 words analyzed previously 
with* the old prc^raa, was reanalyzed. This analysis led us to decide 
ttet if we were going to sake clams ^that we had identified cotnoon word 



: > 

-strings in written text, that aany, inany sore samples of written text 
nee<le^ to be analyzed, In order to elimii^ate the. idles yncrSsies of 
text sampling we belj^eve that- large amounts of text — over 50,000 words — 
.should be used in subsequent analyses.^ 

Major Implications . - ' 

There appears to be little disa"greeTDent t^t tSb more able readers- 
process larger chunks^ of text n>ore rapidly than the less able t^^^^tz, * 
And it ts agreed, we think, that olir instructional practices snonld be 
* such that our students are led to tne point whece they nay, with a ♦ • 
single fixation, read whole pnases o^f two or three or four words - As 
to hew children sn^uld be bro^-i^ht to tnis pbint, however, may be a 

/ 

debatable issue amon^^readinx e:jucatc;rs. We believe %hat coomon 
phrases should -be ta'jgr.t m rxicr, tne sarie manner, in which cosmon words 
are taught and a s-oggesteci pn^edure is presented here. 

If we know that "in tne" and "of the", for example, are consnon 
'word strings in t^t tr.en it seeris reasons o]fe that they be taught as 
' a group wit^* a nour, fojnd ir. tne text tne students are to read. Since 
"in", "tne^", and "of" becoz^e part of a reader's s-ight vocabulary, very 
early they^ will already oe familiar to tne student. Tne task is to ^ej^. 
the student -to read the function word(s) and the -content word, which 
may or may .not be a part of tne student's s i^ht vocabulary, quickly. 

Assume, for example, tnat the student nas a signt vocaoulary ofj 
approximately 1CCr words ar;c th^ tne words ''street'' and "pond'' are to 
be introduced as new words m a lesson. The concepts or the ifeariihg 
of "street" and "pond" will be discussed with the student by the 
teacher. Then the teacner will present the printed form of th^ word 



ERIC 



8. 



'(•ither in isolation or in content');:^.. . iT the student is to become a ^ 
rapid read^ — and a rai:Sl^^omprehender\^ text — then the* reader should 
be ableXp read the phrases "Tn the ^ street ^*-j^d ""in the pond" quickly 
sinc^'i the meaciing of the phrase is not* m the Wc^N^^ ''in" or -in the word 



"^Xjtie^ but primarily in the word *'street'' and jmore'cof^etely in the 
phrases itself. A similar case may De roade for the presel^tion of 



^larger chunks such clauses and the procedures would ;>e rauch^the ^ 

/ 

sane. . | , 

Our purpose '<<as to cfe^/elop a syst^ to identify comson word strings- 

Since students nsast go Deyor.a the word level in beginnifig reading, we 

believe that the use of consnon word strings found in text wifl fac-ili- 

tate the reader's ability to handl^ larger anc^^Lrger units of text. 

^^^^^ y 



9 



REFERENCES 



ERIC 



Ambfe^ B; R-- "Reading by Phrases/*' California Journal of- Educational 

7' -Research , 1 967, 18; 116-124. ' . ' - • ^ " . 

•Bgnd, G. L. anfi Tinker, 'M. A J Reading Dj^^ficulties > Their Diagnosis 
and Correction , third edition* New York: Appleton-Century-Grof ts, 
1975. • ♦ 

< . . , • . / - _ • 

Carroll, J. B.; Davies, P., and Rfchnan 3. Akiericarr Heritage Word 

* Frequency Book . Boston: • Ho'jghton Mifflin, 1 971 . , . ' , • 

Dale, E. ami Chall, J. S.* "A Formula for Predicting *Readabilit5r, " , . 
Educational Research Bulletin , Ohio State University,- 19^8, 27: -11- 
20; 28: 37-54. • " ^ 

De Stefan, J. S. ' Language , The Learner and^ttfe School . New York: 
John Wiley and Sons, Inp . , 1975? ' • 

Dolch, E. W. Sight Phrase Cards . Champaign, Illinois Garrard Pub- - . ^ 
lishing Company, 1545. ^ ^ 

Fodor, J. A. and Bever, T. G.^ ",The Psychological "Reality- of Linguistic 
Se^ents," Journal of Vertal Learning and Verbal ' gehavior , 1965, 
4: 414-420, 

Harris, A. J. and^ Jacobson, . S. Basic Elementary Reading Vocabularies . ^ 
Hew York: Macmillan, 197 2. , , ^ ' 

Harris, AVJ. and Sipay £. R. How t^- Increase Reading Ability, sixth 

edir^n. New York: David McKay, 1975- * ^ 

Heilman, A W. - Principles and Practices of Teaching ' Read:bng , third 
edition, Columbus: Charles £. Merrill ,JI97 2 . 

• < • * 

Heilman, A- W. and Holmes, A. H. Smuggling Language Into The Teaching 

' of Reading . Columbus: * Charles E. Merrill, 1 972. ' ^ - ^ < 

flopkins, Q.' J. and Moe, A. J. ^*The Validation of a Synthetic Syllable 

Count Appropriate for' Computer-Determined Readability Estimates." ^ 
, Paper presented at the i^eeting of the International Reading Associ- • 
ation4 New York City^May, 1975- ' • 

Jacobs, R. and Rosenbaym, P. S. English Transformational Graroag. 
Walth^r^i, 'Massachusetts: Blaisdell Publishing Cql , >968. - 

Kucera, H': !and Francis, Computational Analysis 5f Present-Day V • ^ 

American Er^lish. Providence:, Brown University Press. 1967. > ' \ 

LaBerge, D. 'and Samuels, S. J. ^'Toward a Theory pf Automatic Information 

P/'ocessing in Reading, Cognitive Psychology , 197 4 , 6 : 2 93 -3 23- , j 



Moe, A. •J. ."Word Lists for Bteginning Readers," Read ing^ Jmprovemejit , 

Fall, 1973, 10: 11-15. ' '* ^ . / ^ 

"'Oakerr, R.^, Heiner, and Cromer, W. ."Identification, Organization aAd- 
Reading Comprehension for Good and Poor Readers*" Journal of 
Educational Pbychology , 1971,' '62: 71-78. 

^ ' ' , ^ V 

Rode, S. S. "Development of Phrase and Clause Boundary Reading ^ 

*Childr9n," Reading Research Quarterly ^ 197^-75, '1Q: 124-1M2.* 

Sam\Aels, S. J. "Autoniatic .Decoding and Reading Comprehension," Language 
• Arts , March, 1976, 53: 323-325. 

Smith, F. Understanding Reading: A Psycholinguistic Analysis of Reading 

'and Learnir^ to Read . New York: rtolt, Rin^hart and Winston^ ' 
^1971.^ . • ' 

Taylor, -S. E., Frackenpohl, H. and^Pattee, J. L. Grad« Level Norms 

for the Components of- the Fundaifiental Reading Skill . BulJ^etin #3, 
Huntington, New YorK: ' Educ^ti^al De^lqpment -Laboratories, ' J960. 

Thorndike, E. L. and Lorge, I. vThe Teacher's Word Book ^ 3Q>0QQ Wor^s ; 
-New York: Teachers College Press j Columbia University, 19^^. 

Wisher, R. A. "^be Effects of Syntactic Expectation During , Reading, " 
Journal of Educational Psychology , 1976, 68: 597-602. ^ . 

* -, ^ 

Wisher, R. A. "Linguistic Expectations and "Memory LimitStions in 

' Reading.^' Paper presented at the twenty-second annual convention 
of the International "Reading Association, Miami Beach, May, .1^977. 

Zint^, M. V. , The . Reading Proces s, sec?^d edition. ^Duouque: Wm. C.^' 
Brown, 1975." * ' * , ^ 



